Big Data - DW & BI: Big Data

Showing posts with label Big Data. Show all posts

Friday, July 10, 2015

Accessing Big Data using DI tools

by Unknown | in DI at 3:04 PM

Companies are investing so much money to understand data they have accumulated for so many years and what value it can can potentially provide.

Hadoop plays an major role in processing/handling Big data ,Hadoop (HDFS) is simply a file system in which the data files are distributed across multiple computer systems (nodes).

A Hadoop cluster is a set of computer systems which function as the file system.

A single file in Hadoop can be spread over an indefinite amount of nodes in the Hadoop cluster.

In theory, there is no limit to the amount of data which the file system can store since it is always possible to add more nodes.

Datastage :-

DataStage has a stage called the Big Data File stage(BDFS) which allows DataStage to read and write from Hadoop.

Before we can utilize this stage in a DataStage job, we have to configure the environment correctly. The following pre-requirements have to be met:

Verify that the Hadoop (BigInsights) cluster is up and running correctly. The status of BigInsights can be checked either from the BigInsights console or from the command line.

Add the BigInsights library path to the dsenv file.

Find out the required connection details to the BigInsights cluster.

BDFS Cluster Host

BDFS Cluster Port Number

BDFS User: User name to access files

BDFS Group: Group name for permissions – Multiple groups can be listed.

The Big Data File stage functions similarly to the Sequential File stage. It can be used as either a source or a target in a job. Other than the required connection properties to the HDFS, the stage has the same exact properties as the Sequential File stage (i.e. First line is column names, Reject mode, Write mode, etc.)

Informatica:-

The informatica has hand full of Big data products which will allows informatica customers to process/access data from Hadoop environment

Power Exchange Connector:-

The power exchange connector has inbuild “hadoop” connector which will allow you to connect to hadoop directly

Informatica Big Data Edition:-

This edition provides an extensive library of prebuilt transformation capabilities on Hadoop, including

data type conversions and string manipulations, high performance cache-enabled lookups, joiners, sorters,

routers, aggregations, and many more

Other functionality provided:-

· Data profiling on Hadoop

· Data Parsing

· Entity Extraction and Data Classification

Monday, May 18, 2015

What exactly is Data Lineage? and Data Lineage tools

by Unknown | in DW at 6:24 PM

Metadata management has become a key area in the company's to keep track of information's passing through many gates and understand it's value and how it's getting changed from one BU and another

Why Data Lineage :-

Lets imagine that an user has complained about user having multiple records with different customer information's.To understand the root cause of this in current world without metadata information will involve SME(subject matter experts) and it takes ages to back track.

Data lineage can answer the questions, tracing the data path (it’s “lineage”) upstream from the source to target, capturing it's original source,data flow transformation and target information's as well .

How to Track :-

This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.

Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure.

Data Lineage Tools :-

IBM InfoSphere Business Information Exchange

IBM InfoSphere Metadata Workbench

Informatica MetaData Manager (PDF download)

Informatica On Demand

Tuesday, June 24, 2014

File Archive using Hadoop Archive

by Unknown | in Big Data at 5:28 AM

Archiving small files

The Hadoop Archive's data format is called har, with the following layout:

foo.har/_masterindex //stores hashes and offsets
foo.har/_index //stores file statuses
foo.har/part-[1..n] //stores actual file data

The file data is stored in multiple part files, which are indexed for keeping the original separation of data intact. Moreover, the part files can be accessed by MapReduce programs in parallel. The index files also record the original directory tree structures and the file statuses. In Figure 1, a directory containing many small files is archived into a directory with large files and indexes.

HarFileSystem – A first-class FileSystem providing transparent access

Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the user had to de-archive the archive before use.

Hadoop Archive is integrated in the Hadoop’s FileSystem interface. The HarFileSystemimplements the FileSystem interface and provides access via the har:// scheme. This exposes the archived files and directory tree structures transparently to the users. Files in a har can be accessed directly without expanding it. For example, we have the following command to copy a HDFS file to a local directory:

hadoop fs –get hdfs://namenode/foo/file-1 localdir

Suppose an archive bar.har is created from the foo directory. Then, the command to copy the original file becomes

hadoop fs –get har://namenode/bar.har#foo/file-1 localdir

Users only have to change the URI paths. Alternatively, users may choose to create a symbolic link (from hdfs://namenode/foo to har://namenode/bar.har#foo in the example above), then even the URIs do not need to be changed. In either case,HarFileSystem will be invoked automatically for providing access to the files in the har. Because of this transparent layer, har is compatible with the Hadoop APIs, MapReduce, the shell command -ine interface, and higher-level applications like Pig, Zebra, Streaming, Pipes, and DistCp.

Tuesday, May 13, 2014

Big Data Technologies

by Unknown | in Big Data at 2:08 AM

Column-oriented databases

Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases

There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce

This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:

The "Map" task, where an input dataset is converted into a different set of key/value pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map" task are combined to form a reduced set of tuples (hence the name).

Hadoop

Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive

Hive is a "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it's a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG

PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData

WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.

PLATFORA

Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.

Storage Technologies

As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree

SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive.

Big Data in the cloud

As we can see, from Dr. Kaur's roundup above, most, if not all, of these technologies are closely associated with the cloud. Most cloud vendors are already offering hosted Hadoop clusters that can be scaled on demand according to their user's needs. Also, many of the products and platforms mentioned are either entirely cloud-based or have cloud versions themselves.

Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.

Monday, April 28, 2014

What is NewSQL

by Unknown | in DI at 2:42 AM

I have explained the importance of Nosql in my previous post, today I am going to explain about Newsql

What is NewSQL ?
As you have understood from my previous post, the traditional RDBMS databases are struggling to keep up with new set of unstructured, large volume of data and time it required to modify any changes to it's schema structure, Nosql databases are came as an alternative option for business to store and process those data.

Group of modern relational database management systems that systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still maintaining the ACID guarantees of a traditional database system are called Newsql databases.

Though NewSQL databases differ in their internal architectures and processing, they all gruped under one umberla and called them self as Newsql databases.

Friday, April 25, 2014

NOSQL 101

by Unknown | in Big Data at 3:25 AM

I am sure most of you heard about market buzz words nosql,newsql... and it often make our DW developers to get confused on this new terms.Most of us just read about these new terms in the web without understanding it.

During my seminar users often question me why should I move into this new technology when the current DBMS has been around for 25+ years and how this is going to change the world.

Before I get into a discussion , lets just start from the basic..

What is NOSQL ?

NOSQL holds a wide variety of different database technologies developed using(JSON,python..) to address volume of data we generate today(like facebook,twitter,..), handling unstructured data,agile methodology,scalable processing needs.

NOSQL Database Types

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality.

Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

When Should I go for Nosql ?

I would say it's depends on the nature of data, if you have simple table structure, spreadsheet, delimited text or others files then you can follow the current RDBMS database which requires you to define the schema before processing it.

Data such us geo-spatial,modecular modeling or very complex and unstructured data then go for Nosql.I am sure you would of spent hours and hours on modeling them into relational tables in past, NOSQL databases allows you to store them without defining schema and allows you make any changes without changing the existing model.

Relational databases require you to define schemas before you can add data. For example, you might want to store data about your customers such as phone numbers, first and last name, address, city and state – a SQL database needs to know what you are storing in advance.

NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.

Data warehousing & BI analytics

RDBMSes are ideally suited for complex query and analysis. Even in today’s world, Hadoop data is sometimes loaded back to an RDBMS for reporting purposes. So an RDBMS is a good choice if the query and reporting needs are very critical.

Real time analytics for operational data is better suited to a NoSQL setting. Further, in cases where data is brought together from many upstream systems to build an application (not just reporting), NoSQL is a must. Today, BI tool-support for NoSQL is new, but growing rapidly.

Thursday, April 24, 2014

Big Data 101

by Unknown | in Big Data at 4:11 AM

Big data has created a significant shift in enterprise technology and stands to transform
much of what the modern enterprise is today. Digital data is everywhere and global data is growing at
40% per year, 90% of the data has been created in the past two years alone. Companies capture millions and trillions of bytes of information about their customers, suppliers,
and operations, and millions of networked sensors are being embedded in the physical world in
devices such as mobile phones, energy meters and automobiles, sensing, creating, and communicating
data.

This data comes from everywhere: from sensors used to gather climate information,
banking transactions, financial market data, transaction records of online purchases,
posts to social media sites, digital pictures and videos posted online, and from
cell phone GPS signals to name a few.

But what exactly is big data? Big data encompasses a complex and large set of diverse structured
and unstructured datasets that are difficult to process using traditional data management practices
and tools.

Big data is too big, moves too fast, or doesn't fit the structures of your database architectures.

Big data spans 3 v's

Variety – Big data extends beyond structured data(OLTP systems) including unstructured data of all varieties: text, audio, video, click streams, log files and more.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.

VARIETY
Structured
Unstructured
Semistructured
All the above

VELOCITY
Batch
Near time
Real time
Streams

VOLUME
Terabytes, Petabytes
Records
Transactions
Tables, files

Why Big Data

Big data is more than a challenge;
It is an opportunity
To understand trends and consumer sentiments in real time.
To find insight in new and emerging types of data.
To make your business more agile &
To answer questions that, in the past, were beyond reach.

Big Data Processing - Types

Big Data Processing Is Uniquely Suited For Some Types Of Data
Traditional object-relational techniques, supplemented by VLDB technology, will continue to meet most data management needs. Organizations must examine potential size and growth of data sources closely when evaluating new usage opportunities. Besides the potential to grow large, certain characteristics of data make it suitable for horizontally scalable, distributed processing:
Poorly structured, lightly structured, or unstructured data. Big data processing technologies are particularly well-suited to large volumes of lightly structured data such as web pages, blogs, and messaging protocols (email, instant messaging, and microblogs). This type of data adapts well to a hierarchical schema with sparsely populated attributes, which is the basis of many big data advancements driven by Web 2.0 companies such as Google, Facebook, and Yahoo.
Simply structured data streams. Data generated from a sensor network, such as RFID or medical equipment, can be accessed as a stream of data representing a simple structure of measured values. Distributed processing is well-suited to handling sensor network traffic, making big data processing technology a natural extension.
Binary or character encoded file data. Images and audio files are often best represented by a hierarchical database schema in which individual records are stored as objects. The structure of these systems closely parallels that of distributed networks, making this type of data a good fit for big data processing. Social networking and search companies are at the forefront of this usage scenario.

Thursday, April 17, 2014

Big Data and Hadoop Questions and Answers

by Unknown | in Q&A at 3:00 AM

What is Big Data?
Big data is data that exceeds the processing capacity of traditional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

What is NoSQL?
NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure. Sometimes you will also see it revered to as 'not only SQL'.

We have already SQL then Why NoSQL?
NoSQL is high performance with high availability, and offers rich query language and easy scalability.
NoSQL is gaining momentum, and is supported by Hadoop, MongoDB and others. The NoSQL Database site is a good reference for someone looking for more information.

What is Hadoop and where did Hadoop come from?
By Mike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?
By Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

What is the Difference between Hadoop and Apache Hadoop?
There is no diff, Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project.

What is the difference between SQL and NoSQL?

Why would NoSQL be better than using a SQL Database? And how much better is it?
It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.

Name the modes in which Hadoop can run?
Hadoop can be run in one of three modes:
i. Standalone (or local) mode
ii. Pseudo-distributed mode
iii. Fully distributed mode

What do you understand by Standalone (or local) mode?
There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

What is Pseudo-distributed mode?
The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.

What does /var/hadoop/pids do?
It stores the PID.

What is the full form of HDFS?
Hadoop Distributed File System

What is the idea behind HDFS?
HDFS is built around the idea that the most efficient approach to storing data for processing is to optimize it for write once, and read many approach.

Where does HDFS fail?
Cannot support large number of small files as the file system metadata increases with every new file, and hence it is not able to scale to billions of files. This file system metadata is loaded into memory and since memory is limited, so is the number of files supported.

What are the ways of backing up the filesystem metadata?
There are 2 ways of backing up the filesystem metadata which maps different filenames with their data stored as different blocks on various data nodes:
Writing the filesystem metadata persistently onto a local disk as well as on a remote NFS mount.
Running a secondary namenode.

What is Namenode in Hadoop?
Namenode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.

What is DataNode in Hadoop?
Namenode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.

What is Secondary NameNode?
The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS, Like the NameNode, Each cluster has one SNN, and it typically resides on its own machine as well.

What is JobTracker in Hadoop?
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

What are the functions of JobTracker in Hadoop?
Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running.
If a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop cluster. It is typically run on a server as a master node of the cluster.

What is MapReduce in Hadoop?
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

What are the Hadoop configuration files?
1. hdfs-site.xml
2. core-site.xml
3. mapred-site.xml

Monday, July 15, 2013

Hadoop Tutorial: Intro to HDFS

by Unknown | in Other at 1:14 PM

Big Data - DW & BI

Translate to your Language

Labels